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Abstract — This paper concerns the construction of tests for 
universal hypothesis testing problems, in which the alternate 
, hypothesis is poorly modeled and the observation space is large. 

The mismatched universal test is a feature-based technique for 
' this purpose. In prior work it is shown that its finite-observation 
' performance can be much better than the (optimal) Hoeffding 
test, and good performance depends crucially on the choice of 
features. The contributions of this paper include: 

(i) We obtain bounds on the number of e-distinguishable 
distributions in an exponential family. 

(ii) This motivates a new framework for feature extraction, 
cast as a rank-constrained optimization problem. 

] (iii) We obtain a gradient-based algorithm to solve the rank- 
constrained optimization problem and prove its local con- 
vergence. 

Keywords: Universal test, mismatched universal test, hypothesis 
testing, feature extraction, exponential family 

\ I. Introduction 

A. Universal Hypothesis Testing 

\ In universal hypothesis testing, the problem is to design 
' a test to decide in favor of either of two hypothesis HQ 
and HI, under the assumption that we know the probability 
distribution tt" under HQ, but have uncertainties about the 
probability distribution tt^ under HI. One of the applications 
that motivates this paper is detecting abnormal behaviors 
] OJ: In the applications envisioned, the amount of data from 
I abnormal behavior is limited, while there is a relatively large 
amount of data for normal behavior 

To be more specific, we consider the hypothesis test- 
' ing problem in which a sequence of observations Z" := 
(Zi, . . . , Zn) from a finite observation space Z is given, where 
n is the number of samples. The sequence Z" is assumed to be 
i.i.d. with marginal distribution tt* G 'P(Z) under hypothesis 
Hi (i — 0, 1), where 7'(Z) is the probability simplex on Z. 

Hoeffding |2| introduced a universal test, defined using the 
empirical distributions and the Kullback-Leibler divergence. 
The empirical distributions {T" : n > 1} are defined as 
elements of V{Z) via, 

n 

r"(A) = - Vi{Zfc e A}, Acz. 

The Kullback-Leibler divergence for two probability distribu- 
tions fJ.^,fjP G ^(Z) is defined as, 

i?(A*i||/) = (/,log(MV/))- 



where the notation (/i, /) denotes expectation of / under the 
distribution /i, i.e., (/i, /) — J^z l^{^)f{'^)- The Hoeffding test 
is the binary sequence, 

<^::=I{Z?(r"||^°)>ry}, 

where 77 is a nonnegative constant. The test decides in favor 
of HI when 0" = 1. 

It was demonstrated in [|3] that the performance of the 
Hoeffding test is characterized by both its error exponent 
and the variance of the test statistics. We summarize this in 
Theorem ll.il The error exponent is defined for a test sequence 
<p := {(pi, (p2, ■ ■ ■} adapted to Z" as 

j":=liminf-ilog(^O{0„ = 1}), 

ji:=liminf-ilog(^i{0„ = O}). 

^ n— >-C30 Ji 

Theorem 1.1: 1) The Hoeffding test achieves the optimal 
error exponent among all tests satisfying a given 
constant bound ?/ > on the exponent J^, i.e., J°h > rj 
and 

Jl^^ = sup{ : subject to J° > 77}, 

2) The asymptotic variance of the Hoeffding test depends on 
the size of the observation space. When Z" has marginal 
tt", we have 

lim Var[7ii:i(r"||7r")] = i(|Z| - 1). 

Theorem 11.11 is a summary of results from 12, E). The 
second result can be derived from |4|, |5|, fS^. It has been 
demonstrated in [31 that the variance implies a drawback of 
the Hoeffding test, hidden in the analysis of the error exponent: 
Although asymptotically optimal, this test is not effective when 
the size of the observation space is large compared to the 
number of observations. 

B. Mismatched Universal Test 

It was demonstrated in |3| that the potentially large variance 
in the Hoeffding test can be addressed by using a generaliza- 
tion of the Hoeffding test called the mismatched universal test, 
which is based on the relaxation of KL divergence introduced 
in Q. The name of the mismatched divergence comes from 
literature on mismatched decoding |8|. The mismatched uni- 
versal test enjoys several advantages: 

1) It has smaller variance. 



2) It can be designed to be robust to errors in the knowledge 
of ^0. 

3) It allows us to incorporate into the test partial knowledge 
about TT^ (see Lemma IZTT i. as well as other considerations 
such as the heterogeneous cost of incorrect decisions. 

The mismatched universal test is based on the following 
variational representation of KL divergence, 

D(Ai||7r) = sup((Ai,/)-log((7r,eO)) (D 
/ 

where the optimization is taken over all functions / : Z ^> M. 
The supremum is achieved by the log-likelihood ratio. 

The mismatched divergence is defined by restricting the 
supremum in ([TJ to a function class T\ 

D^^{ii\\-K) sup ((a*, /) - log((^, e^)) . (2) 
The associated mismatched universal test is defined as 

(/)^^ = i{i:»™(r"||7r") > ^\. 

In this paper we restrict to the special case of a linear 
function class: T = {/r := X^i^'^i^j} where {-(Ai} is a set of 
basis functions, and r ranges over R'^. We assume throughout 
the paper that {"04} is minimal, i.e., {1, ■!/;(;} are 
linearly independent. The basis functions can be interpreted 
as features for the universal test. In this case, the definition 
dill reduces to the convex program, 

i^™(M||7r)= sup((Ai,/.)-log((^,e/'-)). 

The asymptotic variance of the mismatched universal test is 
proportional to the dimension of the function class d instead 
of |Z| — 1 as seen in the Hoeffding test: 

lim Var[ni:)™(r"||7r'')] = \d, 

when has marginal 7r° |3 |. In this way we can expect sub- 
stantial variance reduction by choosing a small d. The function 
class also determines how well the mismatched divergence 
i:)™(7ri||7r") approximates the KL divergence D{'k^\-k'^) for 
possible alternate distributions Tr^and thus the error exponent 
of the mismatched universal test In sum, the choice of the 
basis functions {-iAi} is critical for successful implementation 
of the mismatched universal test. The goal of this paper is to 
construct algorithms to construct a suitable basis. 

C. Contributions of this paper 

In this paper we propose a framework to design the function 
class F, which allows us to make the tradeoff between the er- 
ror exponent and variance. One of the motivations comes from 
results presented in Section HI] on the maximum number of e- 
distinguishable distributions in an exponential family, which 
suggests that it is possible to use approximately d — log(p) 
basis functions to design a test that is effective against p 
different distributions. In Section [III] we cast the feature ex- 
traction problem as a rank constrained optimization problem, 
and propose a gradient-based algorithm with provable local 
convergence property to solve it. 



The construction of a basis studied in this paper is a par- 
ticular case of the feature extraction problems that have been 
studied in many other contexts. In particular, the framework in 
this paper is connected to the exponential family PCA setting 
of [10|. The most significant difference between this work 
and the exponential PCA is that our framework finds features 
that capture the difference between distributions, and the latter 
finds features that are common to the distributions considered. 

The mismatched divergence using empirical distributions 
can be interpreted as an estimator of KL divergence. To im- 
prove upon the Hoeffding test, we may apply other estimators, 
such as those using data dependent features LIU, 1 12J . or those 
motivated by source-coding techniques llT3l and others lfT4l . 
Our approach is different from them in that we exploit the 
limited possibilities of alternate distributions. 

II. Distinguishable Distributions 

The quality of the approximation of KL divergence using 
the mismatched divergence depends on the dimension of the 
function class. The goal of this section is to quantify this 
statement. 

A. Mismatched Divergence and Exponential Family 

We first describe a simple result suggesting how a basis 
might be chosen given a finite set of alternate distributions, so 
that the mismatched divergence is equal to the KL divergence 
for those distributions: 

Lemma 2.1: For any p possible alternate distributions 
{tt^,tt'^, . . . ,nP}, absolutely continuous with respect to 7r°, 
there exist d = p basis functions {-01, . . . , V^^} such that 
£)MM(^z||^0) ^ £)(^i||^0) gj^^.jj rj^ggg functions can be 

chosen to be the log-likelihood ratios {i/ji = log(7rY7r°)}. □ 

It is overly pessimistic to say that given p distributions we 
require d = p basis functions. In fact, Lemma l272l demonstrates 
that if all p distributions are in the same d-dimensional 
exponential family, then d basis functions suffices. We first 
recall the definition of an exponential family: For a function 
class T and a distribution u, the exponential family £{i',T) 
is defined as: 

We will restrict to the case of linear function class, and we 
say that the exponential family is d-dimensional if this is the 
dimension of the function class F. The following lemma is a 
reinterpretation of Lemma 1231 for the exponential family: 

Lemma 2.2: Consider any p + 1 mutually absolutely con- 
tinuous distributions {tt' : < i < p]. Then D^fiTT^WiT^) = 
D(7r'||7r-') for all i ^ j \f and only if tt' G £{-k^,F) for all i. 

B. Distinguishable Distributions 

Except in trivial cases, there are obviously infinitely many 
distributions in an exponential family. In order to characterize 
the difference between different exponential families of dif- 
ferent dimension, we consider a subset of distributions which 
we call e-distinguishable distributions. 



The motivation comes from the fact that KL divergences 
between two distributions are infinite if neither is absolutely 
continuous with respect to the other, in which case we say 
they are distinguishable. When the distributions are distin- 
guishable, we can design a test that achieves infinite error 
exponent. For example, consider two distributions 7r",7r^ on 
Z = {zi, Z2, 23}: 7rO(zi) = 7r0(z2) = 0.5; ir^z^) = 7:^23) = 
0.5. It is easy to see that the two error exponents of the test 
(f>„ {Z]') = I{r"(z3) > 0.2} are both infinite. It is then natural 
to ask: Given p distributions that are pairwise distinguishable, 
how many basis functions do we need to design a test that is 
effective for them? 

Distributions in an exponential family must have the same 
support. We thus consider distributions that are approximately 
distinguishable, which leads to the definitions listed below: 
Consider the set-valued function F"^ parametrized by e > 0, 

F'^{x) {z : x{z) > max(2;(z)) — e} 

z 

• Two distributions tt^, tt^ are e-distinguishable if F{tt^ ) \ 
F{Tr^) ^ and F{7t^) \ F{tt^) ^ 0. 

• A distribution tt is called e-extremal if 7r(F'(7r)) > 1 — e, 
and a set of distributions A is called e-extremal if every 
TT G ^ is e-extremal. 

• For an exponential family £, the integer N{£) is defined 
as the maximum N such that there exists an eo > such 
that for any < £ < eo, there exists an e-extremal ^ C f 
such that > N and any two distributions in A are 
£-distinguishable. 

One interpretation of the final definition is that the test using 
a function class F is effective against N{£) distributions, in 
the sense that the error exponents for the mismatched universal 
test are the same as for the Hoeffding test, where £ = £{v, F): 

Lemma 2.3: Consider a function class F and its associated 
exponential family £ = £{i',F), where v has full support, 
and define N = N{£{i/, F)). Then, there exists a sequence 
{A^'^\A^^\. . . , A^™) : m > 1}, such that for each k the set 
A^'^^ C £ consists of N distributions, 

D'^{7r\\n') = D{7r, tt') for any n, tt' e A'-''^ 

and 

lim min i:>^"(7rl|7r') = 00. 

□ 

Let V{d) denote the collection of all d-dimensional expo- 
nential families. Define N{d) — maxggpf^) ^i^)- In the next 
result we give lower and upper bounds on N{d), which imply 
that N{d) depends exponentially on d: 

Proposition 2.4: The maximum N{d) = maxf N{£) ad- 
mits the following lower and upper bounds: 

N{d) > exp(L^J[log(|Z|)-logL^J-l]) (3) 

Nid) < exp((d+l)(l + log(|Z|)-log(d+l))) (4) 



It is important to point out that N{d) is exponential in d. 
This answers the question asked at the beginning of this sec- 
tion: There exist p approximately distinguishable distributions 
for which we can design an effective mismatched test using 
approximately log(p) basis functions. 

III. Feature Extraction via 
Rank-constrained Optimization 

Suppose that it is known that the alternate distributions can 
take on p possible values, denoted by tt^ , tt^ , . . . , tt^. Our goal 
is to choose the function class F of dimension d so that the 
mismatched divergence approximates the KL divergence for 
these alternate distributions, while at the same time keeping the 
variance small in the associated universal test. The choice of 
d gives the tradeoff between the quality of the approximation 
and the variance in the mismatched universal test. We assume 
that < D{tt^\\tt°) < 00 for all 

We propose to use the solution to the following problem as 
the function class: 

1 ^ 

max{- ^7'D™(^1^0) : dim(^) < d} (5) 

P 1=1 

where dim 7^ is the dimension of the function class F. The 
weights {ji} can be chosen to reflect the importance of 
different alternate distributions. This can be rewritten as the 
following rank-constrained optimization problem: 

max iELi7^(^%^*)-log((^°^e^')) 

(6) 

subject to rank {X) < d 

where the optimization variable X is a p x |Z| matrix, and Xi 
is the ith row of X, interpreted as a function on Z. Given an 
optimizer X*, we choose {i/ji} to be the set of right singular 
vectors of X* corresponding to nonzero singular values. 

A. Algorithm 

The optimization problem in (|6]l is not a convex problem 
since it has a rank constraint. It is generally very difficult 
to design an algorithm that is guaranteed to find a global 
maximum. The algorithm proposed in this paper is a gener- 
aUzation of the Singular Value Projection (SVP) algorithm of 
IIT5I designed to solve a low-rank matrix completion problem. 
It is globally convergent under certain conditions valid for 
matrix completion problems. However, in this prior work the 
objective function is quadratic; we are not aware of any prior 
work generalizing these algorithms to the case of a general 
convex objective function. 

Let h{X) denote the objective function of (|6]l. Let S denote 
the set of matrices satisfying rank {X) < d. Let Vs denote 
the projection onto S: 

VsiY) = argmin{||y : rank(X) < d}. 

'in practice the possible alternate distributions will likely take on a 
continuum of possible values. It is our wishful thinking that we can choose 
a finite approximation with p distributions, and choose d much smaller than 
p, and the resulting mismatched universal test will be effective against all 
alternate distributions. Validation of this optimism will be left to future work. 



where we use || • || to denote the Frobenius norm. The algorithm 
proposed here is defined as the following iterative gradient 
projection: 

1) y'^'+i = X'' + a''\7h{X''). 

2) X'^+i = Ps{Y''+^). 

The projection step is solved by keeping only the d largest 
singular values of 1"*^+^. The iteration is initialized with some 
arbitrary X'^ and is stopped when the — X'^\\ < e for 

some small e > 0. 

B. Convergence Result 

We can establish local convergence: 

Proposition 3.1: Suppose X satisfies rank (X) = d and is a 
local maximum, i.e. there exists S > such that for any matrix 
X e S satisfying \\X - X\\ < (5, we have h{X) > h{X). 
Choose a'^ = a for all k where 0<a<2/(i max.; 7*). Then 
there exists a (5' > such that if satisfies \\X° -Xj] < 5' 
and rank < d, then X*"' X as A; -)> cxd. Moreover, the 
convergence is geometric. □ 

Let n denote the hyperplane U = {XWi +W2X -.Wi e 
M"^", 1^2 e KP^P}. The main idea of the proof is that near 
X the set S can be approximated by this hyperplane H, as 
demonstrated in Lemma 13.21 

Lemma 3.2: There exist S > and M > such that: 1) for 
any X € S satisfying ||X — X|| < S, there exists Z G "H such 
that \\Z_- X\\ < M\\X - 2) for any Z e 7^ satisfying 
\\Z - X\\_< S, there exists X e S satisfying \\X - Z\\ < 
M\\Z-Xf. 

Let Z^ = VniY''), i.e., the projection of onto n. We 
obtain from Lemma |X2l that Z'^ is close to as follows: 

Lemma 3.3: Consider any X satisfying rank [X] = d. 
There exist (5 > and M > such that if \\Z'' - X\\ < S, 
then \\Z'' -X^W < M\\Y'' -X\\i. 

Lemma 3.4: Gradients of h{X) are Lipschitz with constant 
L = imax,y, i.e. \\Vh{Xi) ~ Vh{X2)\\ < L\\Xi - X2\\. 

Lemma 3.5: Suppose X is a local maximum in S and 

rank {X) — d. Then X is also a local maximum in "H. 

Outline of Proof of Proposition 13.71 Using standard 

results form optimization theory, we can prove that for any 

small enough (5 > 0, if \\X'' - X\\ < S and a < j^, then 
ll^-fc+i _ < ^ii^^fc _ gQjj^g q ^ I ^jjgj.g ^ ^Q^jjj 

depend on S, and \\Y''+^ - X\\ < \\X'' - X\\. Thus, we can 
choose a S small enough so that M62 < With this choice, 
we have 

\\X''+^~X\\ < + iiz'^+i 

< \\Z''+^ -X\\+MSi\\Y''+^ -X\\ 

< {q + ^{l-qmX^-Xl 
Proposition 13.11 then follows from induction. ■ 

IV. Simulations 

We consider probability distributions in an exponen- 
tial family of the form 7r'(z) = cxp{J2t-i (^i,ktpi{z) + 
Yli=k^i,ki'ii^)}- We first randomly generate {ipi} and {ip[} 
to fix the model. A distribution is obtained by randomly 
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Fig. 1: Dashed curve: average of D""{fi^\\n'-'')/D{fj.^\\n^'). Solid curve: 
average of D™{7r* ||7rO)/D(7r' ||7r0) 

generating {Oi^k} and {9'^ ).} according to uniform distributions 
on [—1,1] and [—0.1,0.1], respectively. In application of the 
algorithm presented in Section ITlI-AI the bases {tpi} and {tpl} 
are not given. This model can be interpreted as a perturbation 
to g-dimensional exponential family with basis {V-'i}- 

In the experiment we have two phases: In the feature extrac- 
tion (training) phase, we randomly generate p+1 distributions, 
taken as vr", . . . , vr^. We then use our techniques in (|5]l with the 
proposed algorithm to find the function class J'. The weights 
7,; are chosen as 7* = 1/L)(7r'||7r") so that the objective value 
is no larger than 1. In the testing phase, we randomly generate 
t distributions, denoted by /i^, . . . , /i*. We then compute the 
average of D""{n'\\Tr°)/ D{n'\\Tr°). 

For the experimental results shown in Figure [T] the param- 
eters are chosen as q = 8, q' = 5, and t = 500. Shown in the 
figure is an average of D"-"{tt^\\tt^)/ D{tt^\\tt'^) (for training) 
as well as D""{p'\\tt°)/D{p'^\\tt°) (for testing) for two cases: 
p = 50 and p — 500. We observe the following: 

1) The objective value increases gracefully as d increases. 
For d > 7, the values are close to 1. 

2) The curve for training and testing are closer when p is 
larger, which is expected. 

V. Conclusions 

The main contribution of this paper is a framework to 
address the feature extraction problem for universal hypothesis 
testing, cast as a rank-constrained optimization problem. This 
is motivated by results on the number of easily distinguishable 
distributions, which demonstrates that it is possible to use a 
small number of features to design effective universal tests 
for a large number of possible distributions. We propose a 
gradient-based algorithm to solve the rank-constrained opti- 
mization problem, and the algorithm is proved to converge 
locally. Directions considered in current research include: 
applying the nuclear-norm heuristic 1 16| to solve the optimiza- 
tion problem (|5j, applying this framework to real-world data, 
and extension of this framework to incorporate other form of 
partial information. 

Appendix 

A. Proof of the lower bound in Proposition \2.4\ 

We give a constructive proof of the lower bound ^ by 
combining ideas in Lemma lA.ll and IA.2I 
Lemma A.l: N{2) > jZj. 



Proof: We pick the following two basis functions i/^i , i/)2 
^i = [|Z|-l,|Z|-2,...,0], 

2 |Z|-1 

and ^2 = [l,1.5,^2-^..., ^ 2-^"]. 



(7) 



, . . . , 



For 1 < A: < |Z|, define u*^ as u = tpi + 2*^"-^i/)2. 
Assuming without loss of generality that Z = {1,...,|Z|}, 
we have argmax^ u'^(z) — k . 

Now, for any /?>0, l</c<|Z|, define the distribution 

7r'=''3(z) =Cexp{/3M'^(z)}. 

where C is a normalizing constant. Since there are only finite 
choices of k, for any small enough e, there exists /3o such that 
for /3 > /3o, {tt'^'^, 1 < < |Z|} are e-extremal and any two 
distributions in {tt'^^'^,! <k< |Z|} are e-distinguishable. ■ 
Lemma A.2: N{d) > {^//^^) 

Proof: Take ipk{z) = l{z = fc} for 1 < fc < d. ■ 
Outline of proof of the lower bound: The basis functions 
used in the construction are the Kronecker products of basis 
functions used for Lemma IA.2I and Lemma I A. 11 



Let J — [|Z|/[i(iJJ. Let V'i,'02 denote the basis function 
defined in (Q with |Z| replaced by J. The basis functions used 
for the lower bound are given by 

^fe(«+jJ) = I{i = fc-l}^iW, for 1 < fc< [idj, 
i'k+Vd/2\{i+]J)=m = k-l}M^, for 1 < fc< [\d\. 



B. Proof of the upper bound in Proposition \2.4\ 

The main idea of the proof of (|4|i is to relate this bound to 
VC dimension. We first obtain an elementary upper bound. 
Lemma A.3: N{£) < N{£), where 



N{£) = \{F%Y,n^i) -.r eM.'',e>0}\. 
I 



Proof: By definition if a subset A of f is e-extremal, and 
any two distributions in A are e-distinguishable, then for any 
two distributions 7r*,7r-' G A, there exists ei,e2 > such that 

F^i(log(7ri))^^^^^(lo8V))- ■ 
Let H denote the set of all the half space in K'', and let 

FC(H) denote the VC dimension of H. It is known that 

l^C(H) = d + 1 ini Corollary of Theorem 1]. 
For any finite subset B of R'^, define t{B) ^\{hnB :he 

H}|. In other words, t{B) is the number of subsets one can 

obtain by intersecting B with half-spaces from H. A bound 

on t{B) is given by Sauer's lemma: 

Lemma A.4 (Sauer's Lemma): The following bound holds 

whenever \B\ > VC(H.): 

r(B) < (^\^)VC(ii} 

Consider any d-dimensional exponential family £ with basis 
{■0;, 1 < Z < d}. Define a set of function {y'} C M.'^ via. 



l<i< \Z\, l<j<d. 



In other words, if we stack {ipi} into a matrix so that each 
tpi is a row, then {j/*} are the columns of this matrix. Let 



B{£) — < i < \Z\}. The following lemma connects 

r(B(£)) to N{£). 

Lemma A.5: N{£) < t{B[£)). 

Proof: For given r e W'' and e > 0, denote / = 
F^(^;r/V'z)- By the definition of we have I = {i : 
T^y^ > sup2(X]; 'ri'4)i{z)) — e}. Therefore, there exists h such 
that r'^y' > h for all z S /, and r'^y' < h for all i ^ I. That is, 
/ is the subset of {y*} that lies in the half space {y : r'^y > b}. 
Thus, {y' : i G 1} e {hn B{£) : h e H}. Since this holds 
for any element in {F^J^i nV'O : r e M'', e > 0}, we obtain 
the result. ■ 

Proof of the upper bound: We obtain (|4|i on combining 
Lemma IA.3I Lemma IA.4I and Lemma IA.5I together with the 
identity VC(H) = d+l. ■ 

Acknowledgment: This research was partially supported 
by AFOSR under grant AFOSR FA9550-09-1-0190 and NSF 
under grants NSF CCF 07-29031 and NSF CCF 08-30776. 
Any opinions, findings, and conclusions or recommendations 
expressed in this material are those of the authors and do not 
necessarily reflect the views of the AFOSR or NSF. 

References 

[1] D. E. Denning, "An intrusion-detection model," IEEE Trans. Softw. Eng., 

vol. 13, no. 2, pp. 222 - 232, 1987. 
[2] W. Hoelfding, "Asymptotically optimal tests for multinomial distribu- 

tions," Ann. Math. Statist., vol. 36, pp. 369 - 401, 1965. 
[3] J. Unnikrishnan, D. Huang, S. Meyn, A. Surana, and 

V. Veeravalli, "Universal and composite hypothesis testing via 

mismatched divergence," submitted for publication. [Online]. Available: 

http://arxiv.org/abs/0909.2234 
[4] S. S. Wilks, "The large-sample distribution of the likelihood ratio for 

testing composite hypotheses," Ann. Math. Statist., vol. 9, pp. 60 - 62, 

1938. 

[5] B. S. Clarke and A. R. Barron, "Information-theoretic asymptotics of 
Bayes methods," IEEE Trans. Inf. Theory, vol. 36, no. 3, pp. 453 - 471, 
May 1990. 

[6] I. Csiszar and R C. Shields, "Information theory and statistics: A 
tutorial," Foundations and Trends in Communications and Information 
Theory; vol. 1, no. 4, pp. 417 - 528, 2004. 

[7] E. Abbe, M. Medard, S. Meyn, and L. Zheng, "Finding the best 
mismatched detector for channel coding and hypothesis testing," in 
Information Theory and Applications Workshop, 2007, 29 Feb. 2007, 
pp. 284 - 288. 

[8] N. Merhav, G. Kaplan, A. Lapidoth, and S. S. Shitz, "On information 
rates for mismatched decoders," IEEE Trans. Inf. Theory, vol. 40, no. 6, 
pp. 1953 - 1967, Nov. 1994. 

[9] D. Huang, I. Unnikrishnan, S. Meyn, V. VeeravalU, and A. Surana, "Sta- 
tistical SVMs for robust detection, supervised learning, and universal 
classification," in IEEE Information Theory Workshop on Networking 
and Information Theory, lun. 2009, pp. 62 - 66. 
[10] M. ColUns, S. Dasgupta, and R. E. Schapire, "A generalization of 
principal component analysis to the exponential family," in Advances 
in Neural Information Processing Systems, vol. 14. MIT Press, 2001, 
pp. 617-624. 

[11] Q. Wang, S. R. Kulkarni, and S. Verdu, "Divergence estimation of con- 
tinuous distributions based on data-dependent partitions," IEEE Trans. 
Inf. Theory, vol. 51, no. 9, pp. 3064 - 3074, Sep. 2005. 

[12] , "Divergence estimation for multidimensional densities via - 

nearest-neighbor distances," IEEE Trans. Inf. Theory, vol. 55, no. 5, 
pp. 2392 - 2405, May 2009. 

[13] J. Ziv and N. Merhav, "A measure of relative entropy between individual 
sequences with application to universal classification," IEEE Trans. Inf. 
Theory, vol. 39, no. 4, pp. 1270 - 1279, Jul. 1993. 

[14] X. Nguyen, M. J. Wainwright, and M. I. Jordan, "Estimating divergence 
functionals and the Hkelihood ratio by convex risk minimization," 
Department of Statistics, UC Berkeley, Tech. Rep. 764, Jan. 2007. 



[15] R. Meka, P. Jain, and I. S. Dhillon, "Guaranteed rank 
minimization via singular value projection," 2009. [Online]. Available: 
http://www.citebase.org/abstract?id=oai:arXiv.org:0909.5457 

[16] M. Fazel, H. Hindi, and S. Boyd, "A rank minimization heuristic with 
application to minimum order system approximation," in Proceedings 
of the american control conference, vol. 6, 2001, pp. 4734 - 4739. 

[17] C. J. C. Burges, "A tutorial on Support Vector Machines for pattern 
recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 
121 - 167, Jun. 1998. 



